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ABSTRACT 

Motivation: Genome assembly tools based on the de Bruijn graph 
framework rely on a parameter k, which represents a trade-off 
between several competing effects that are difficult to quantify There 
is currently a lack of tools that would automatically estimate the best k 
to use and/or quickly generate histograms of fc-mer abundances that 
would allow the user to make an informed decision. 

Results: We develop a fast and accurate sampling method that 
constructs approximate abundance histograms with a several orders 
of magnitude performance improvement over traditional methods. We 
then present a fast heuristic that uses the generated abundance 
histograms for putative k values to estimate the best possible value 
of k. We test the effectiveness of our tool using diverse sequencing 
datasets and find that its choice of fe leads to some of the best 
assemblies. 

Availability: Our tool KmerGenie is freely available at: 

http : //kmergenie ■ bx.p su.edu/ 
Contact: chikhi@psu.edu 



1 INTRODUCTION 

Genome assembly continues to be a fundamental aspect of high- 
throughput sequencing data analysis. In the years since the first 
methods were developed, there have been numerous improvements 
and the field is now rich with tools that pro vide biolog ists several 
options JRlbeiro et al.Ll2012llLuo et"aill20ll lFankevich et alil2013 , 



ChMii and Rizk, 2012. Pens et al.L I2012L ISimpson and Durbir . 
2011;. Zerbino and Bimev. 20081 . Many of these tools are based 



on the de Bruijn graph framework, where reads are chopp ed up 
into fc-mers (substrings of length k) IPevzner et al.L 1200111 . The 
de Bruijn graph is constructed with nodes being the (k — l)-mers 
and the edges being the fc-mers present in the reads. Broadly 
speaking, an assembler constructs the graph, performs various graph 
simplification steps, and outputs non-branching paths as contigs - 
contiguous regions which the assembler predicts are in the genome. 
Recently, there have been several meta-analyses of assemblers 
that have pointed to systematic shortcomings of current methods [Ea rl 
et al. l201lLISalzberget al.Ll201lLlAlkan et al.Ll201lLlBradnam et all 
I2OI3II. The Assemblathon competitions lEarl et al.Ll201lL Bradnam 
et al., 1201311 demonstrated that assembling a dataset still requires 
significant expert intervention. One issue is many assembler's lack 
of robustness with respect to the parameters and the lack of any 
systematic approach to choosing the parameters. In de Bruijn based 



assemblers, the most significant parameter is k, which determines 
the size of the fc-mers into which reads are chopped up. Repeats 
longer than fc nucleotides can tangle the graph and break up contigs, 
thus, a large value of fc is desired. On the other hand, the longer the fc 
the higher the chances that a fc-mer will have an error in it; therefore, 
making fc too large decreases the number of correct fc-mers present 
in the data. Another effect is that when two reads overlap by less 
than fc characters, they do not share a vertex in the graph and thus 
create a coverage gap that breaks up a contig. Therefore, the choice 
of fc represents a trade-off between several effects. 

Because some of these trade-offs have been difficult to 
mathematically quantify, there has not been an explicit formula 
for choosing fc taking into account all these effects. It is possible 
to calculate some bounds based on estimated genome size and 
coverage (e.g. by applying Lander- Waterman statistics), however, 
such estimates do not usually take into account the impact of 
repetitiveness of the genome, heterozygousity rate, or read error 
rate. In practice, fc is often chosen based on prior experience with 
similar datasets. More thorough approaches compare assemblies 
obtained from different fc values; however, they are very time 
consuming since a single assembly can take days for mammalian- 
size genomes. A more informed initial choice for fc can be made 
by building abundance histograms for putative values of fc and 
comparing them. The abundance histogram shows the distribution 
of fc-mer abundances (the number of occurrences in the data) for a 
single fc value. Such histograms can provide an expert with valuable 
information for choosing fc - however, the time to construct such 
a Mstogramcai^akeji£jo_^_da^forjustasing^ 
et al.. l2013llMarcaisandKingsfor(il20Ilh 

Our contribution in this paper is two-fold. First, we propose an 
accurate sampling method that constructs approximate abundance 
histograms with an order of magnitude speed improvement, 
compared to traditional tools based on fc-mer counting. Our method 
allows an expert user to make an informed decision by quickly 
generating abundance histograms for many fc values and analyzing 
the results, either visually or statistically. Our second contribution 
is a fast heuristic method for selecting the best possible value of fc, 
based on the generated abundance histograms for many values of fc. 
The heuristic is based on the intuition that the best choice of fc is the 
one which provides the most distinct non-eiToneous (genomic) fc- 
mers to the assembler. Our method can be integrated into assembly 
pipelines so that the choice of fc is made automatically without user 
intervention. 
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We implement our methods in a publicly available tool called 
KmerGenie. We test KmerGenie's effectiveness using three 
sequencing datasets from a diverse set of genomes: S. aureus, 
human chromosome 14, and B. impatiens. First, we find that our 
approximation of the histogram is very close to the exact histogram 
and easily separable from histograms for nearby values of k. Next, 
we judge the accuracy of KmerGenie's choice of k by assembling 
the data for numerous k values and comparing the quality of the 
assemblies. We find that KmerGenie's choice leads to the best 
assemblies of 5. aureus and B. impatiens, as measured by the contig 
length (NG50), and to a good assembly of chrl4 that represents a 
compromise between contig length and the number of errors. 



Algorithm 1 Compute approximate abundance histograms 
Input: An integer A: > 0, a set of reads _R, e > 0. 
Output: Approximate abundance histogram of fc-mers in R 
1 

2: 
3: 
4: 
5: 
6: 
7: 



Init empty hash table T, with default values of 0. 
for all fc-mers a; in i? do 

if Pe {x) — then 
T[x] = T[x] + 1 

end if 
end for 

Compute abundance histogram h^ of T 
Let h{i?) = e ■ ht (i) for each i 
Output h 



2 METHODS 

Our method can be summarized as follows. We start by generating 
the abundance histograms for numerous putative values of k. We 
then fit a generative model to each histogram in order to estimate 
how many distinct fc-mers in the histogram are genomic (i.e. error- 
free). Finally, we pick the value of fc which maximizes the number 
of genomic fc-mers. We now describe each step in detail. 

2.1 Building tlie abundance histograms 

Consider a multiset of reads R from a sequencing experiment. For 
a given value of fc, each read is seen as a multiset of fc-mers. For 
instance with fc = 3, the read ATAGATA is the multiset of five 
3-mers (ATA, TAG, AGA, GAT, ATA). By taking the union of all 
reads, a dataset of reads is also seen as a multiset of fc-mers. Each 
fc-mer is said to have an abundance, which is the number of times it 
appears in the multiset. A common function used to understand the 
role of fc is the abundance histogram. For a given abundance value i, 
the function tells the number of distinct fc-mers with that abundance. 
One way to calculate the abundance histogram is to first run a fc- 
mer counting algorithm. A fc-mer counting algorithm takes a set of 
reads and outputs every present fc-mer along with its abundance. The 
abundance histogram can then be calculated in a straightforward 
way. iv -mer counting is itself a well-studied problem with efficient 
solutions and tools, though eve n efficient implementations can take 
hours_orjdavs_on2arge_datasetsj|Ma^ais_Mi^ 



et al., I2OI3I1 . Such a solution would be inefficient for generating 
histograms for multiple values of fc since one would need to run the 
fc-mer counter multiple times. 

Instead, we propose to create an approximate histogram by 
sampling from the fc-mers, an idea explored in a more general 
setting bv lCormode et al.l 1200511 . The pseudocode of our algorithm 
is shown in Algorithm [T] Intuitively, we use a parameter e to 
dictate the proportion of distinct fc-mers we sample. We pick 
a hash function p^ : {A,C,G,T}'' -^ [0..e] that uniformly 
distributes the universe of all possible fc-mers into e buckets. In 
our implementation, we adopted a state-less 64 bits hash function 
(RanHash, page 352 in IPress et al.. 2007]). We then count the 
abundances of only those fc-mers that hash to 0. The abundance 
histogram is then computed from the fc-mer counts, scaling the 
number of fc-mers with a given abundance by e. 

The running time of the algorithm is 0(|_R|(^ — fc)), where (. is 
the read length. The expected memory usage is 0{m/e), where m is 
the number of distinct fc-mers in R. Though the asymptotic running 
time is the same as for an exact fc-mer counting algorithm, the total 



overhead of adding fc-mers to a hash table is reduced by a factor of 
e. Similarly, though the memory usage is asymptotically the same 
as for an exact fc-mer counter, the decrease by a factor of e can make 
it feasible to store the hash table in RAM. 



2.2 Generative model for the abundance histogram 

Given an abundance histogram, our next step is to infer the number 
of distinct genomic fc-mers in it. In principle, if we knew the error 
rate, we could easily estimate the number of genomic fc-mers (not 
necessarily distinct) as a proportion of the total number of fc-mers. 
However, such a simple approach does not allow us to estimate 
which of the fc-mers are genomic and hence does not allow us to 
estimate the number of distinct genomic fc-mers. Moreover, the error 
rate is itself a parameter that is not known prior to assembly, so it 
must be estimated as well. 

Instead, our approach is to take a generative model and fit it to 
the histogram. We can then infer the number of distinct genomic 
fc-mers from the parameters of the model. Fitting a model to a 
histogram has been previously explored in the context of error- 



explored 

correction llChaisson and Pevznej. boOSL JKellev et al.Ll2oToll . We 

lev et alj 1^ 



adopt the model proposed bv lKellev et alj I201C11 , which we describe 
here for completeness. 

2.2.1 Haploid model The fc-mer abundance histogram is a 
mixture of two distributions: one representing genomic fc-mers, and 
one representing erroneous fc-mers. We use the term copy-number 
to denote the number of times a genomic fc-mer is repeated in the 
genome. A genomic fc-mer distribution is itself a mixture of n 
Gaussians, each Gaussian coiTesponding to fc-mers with a copy- 
number 1 < i < n. We fix the maximum copy-number to n = 30. 
For each copy-number i, the mean pt and variance u^ of the 
Gaussian are different values (due to Illumina biases), proportional 
to the copy-number. Thus in our model, the mean and variance 
(/ii , al ) (for copy-number 1) are free parameters, and the remaining 
means and variances are fixed to pi = i/ii and a^ — ia^. The 
weights of the mixture of Gaussians are given by a Zeta distribution, 
which has a single free shape parameter s. The erroneous fc-mers 
distribution is modeled as a Pareto distribution with fixed scale of 
1 and a free shape parameter a. The mixture between erroneous 
and genomic fc-mers is weighted by a free parameter pe, which 
corresponds to the probability that a fc-mer is erroneous. Thus in 
total, our model has five free parameters (pi, erf , s, a,pe). 



2.2.2 Extension to the diploid model In the diploid case, we say 
that a fc-mer is homozygous if it appears in both alleles, and is 
heterozygous otherwise. We model the genomic fc-mers of a diploid 
organism as a mixture of two haploid genomic fc-mer distributions 
Dht and Dhm. A mixture parameter (ph) controls the proportion 
of homozygous fc-mers (drawn from Dhm) to heterozygous fc-mers 
(drawn from Dht)- The parameters of the two distributions remain 
free, with the following exception. As homozygous fc-mers are 
expected to be sequenced with twice the coverage of heterozygous 
fc-mers, the mean of Dhm is fixed to twice the mean of Dht- As 
in the haploid case, we model the erroneous fc-mers as a Pareto 
distribution with parameter a and a mixture proportion of Pe- In 
summary, the diploid model has eight free parameters: the variances 
(af) and the Zeta shapes (s) for the Dhm and Dht distributions; 
additionally, the mean (/ii) of Dht, the Pareto shape (a), the error 
probability (pe), and the heterozygosity proportion (ph). 



2.3 Fitting the model to the histogram 

Given the abundance histogram, we can estimate the parameters 
of the above model that would be the best fit for the obse rved 
histogram. Similarly to what is done in iKellev et alj I2010h . we 
do a maximum-likelihood estimation of the parameters using the 
optim function in R (BFGS algorithm). Let d be the number of 
distinct fc-mers present in the histogram. Let pe be the estimated 
mixture parameter between the erroneous and genomic fc-mers in 
the above model. Immediately, p^d is an estimate of the numbers 
of erroneous fc-mers and (1 — p^)d is an estimate of the number of 
genomic fc-mers present in the reads. 

2.4 Finding the optimal k 

Our key insight is that the best value of fc for assembly is the 
one which provides the most distinct genomic fc-mers. To see this, 
consider the number of distinct fc-mers in the reference genome. 
Observe that, as fc increases, this number also increases and 
approaches the length of the genome, as a consequence of repeat 
instances becoming fully spanned by fc-mers. Thus, a high number 
of distinct genomic fc-mers allows the assembler to resolve more 
repetitions. In the ideal scenario of perfect coverage and error-free 
reads, the best value of fc for assembly would be the read length. 
However, the read coverage is typically imperfect and reads are 
error-prone, requiring a more nuanced approach. 

First, consider obvious high and low thresholds: for fc close to 
the read length, it is unlikely that all the fc-mers in the reference are 
present in the reads due to imperfect coverage. On the other hand, 
note that any genome will contain all fc-mers for a small enough fc 
(e.g. the human genome contains all possible 4-mers). This is due to 
the fact that a significant chunk of a genome behaves like a random 
string. 

Next, we examine the values of fc between these two thresholds. 
Essentially, two effects are competing. The shorter a fc-mer is, the 
more likely it is (i) to appear in the reads, but also (ii) to be repeated 
in the reference. For a typical sequencing depth and values of fc 
near the read length, it is likely that only a small fraction of the 
fc-mers from the reference genome appears in the reads (effect (i)). 
However, unless the sequencing depth is insufficient for assembly, 
there exists of a largest value fco at which nearly all the fc-mers in 
the reference genome are present in the reads for fc < fco. Thus, 



decreasing fc below fco only contributes to making more fc-mers 
repeated (effect (ii)). 

From these observations, we conclude that the number of distinct 
genomic fc-mers in the reads is likely to reach a maximum value. 
At this value, all the fc-mers in the reference genome are likely to 
appear in the reads, thus the assembly at this fc-mer length nearly 
covers all the genome. Also, the assembly is likely to be of high 
contiguity as a large number of distinct fc-mers imply that more 
repetitions are fully spanned by fc-mers. 



3 RESULTS 

3.1 Datasets and assemblers 

We benchmarked KmerGenie using data from the Genome 
Assembly Gold-standard Evaluation (GAGE), whic h was previously 
used to evaluate and compare different assemblers OSalzberg et al.L 
uOlin . We used three datasets from three genomes of different 
sizes: S. aureus (2.8 Mbp), human chromosome 14 (88 Mbp) and 
B. impatiens (250 Mbp). The datasets contain 5mil/62mil/497mil 
lUumina reads of length 101bp/101bp/124bp (respectively). The 
coverages are 167x/70x/247x. 

For each dataset, the GAGE study published the assembler that 
produced the best results, along with its most effective formula 
(including commands and parameters). To assess how our predicted 
fc values relate to the quality of assemblies, we choose, for each 
dataset, the de Bruijn assembler and formula that produced the best 
results i n the GAGE study. For 5. a ureus and chrl4, this was Velvet 
1.2.08 iZerbino and Bimevl 1200811 with the parameters given on 
the GAGEwebsitej_^orSj_imgari«j5j_^e_^se^^OAPdenOT 
et al.. l2012ll using the GAGE recipe and no additional parameters. 
All experiments were run on a 32-core machine (Xeon E7-8837 @ 
2.67 GHz) with 512GB RAM. 



3.2 Performance of KmerGenie's choice of k 

We ran KmerGenie on all three datasets with putative fc values of 
21, 31, 41, 51, 61, 71, 81 and a sampling frequency of e = 1000. 
The optimal values of fc were predicted to be 31 for S. aureus, 
71 for chrl4, and 51 for B. impatiens. We then assembled each 
dataset using both the optimal and other reasonable values of fc. 
The results of each assembly are shown in Table [T] The decision 
of what constitute s a "best" assembly is complicated d ue to the 
inher ent trade-offs llSalzberg et al.ll201 lllEarl et al.ll201 iL Bradnam 
et al., I2OI3I1 . We therefore evaluated each assembly using three 
common quality measures: the contig NG50 length, the assembly 
size, and the number of assembly errors. Contig NG50 is defined as 
the length at which half of the predicted genome size is contained in 
contigs longer than this length. Contigs were obtained by splitting 
reported scaffolds at each undetermined nucleotide. The size was 
measured as the sum total length of contigs larger than 500 bp. The 
Error col umn reflects the numbe r of mis-joins called by the QUAST 
software JGurevich et al.Ll2013ll . Note that for B. impatiens there is 
no reference available and hence it is not possible to measure the 
number of errors. QUAST reports an assembly error as a position 
in the assembled contigs where one of the following mis-assembly 
events occur: (i) the left flanking sequence aligns over 1 kbp away 
from the right flanking sequence on the reference, or (ii) they 
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Assembly Contig NG50 (Kbp) Size (Mbp) EiTors 



S. aureus 


(Velvet) 






fc = 21 


0.5 


7.65 





fe = 31 


19.4 


2.83 


10 


fc = 41 


11.7 


2.81 


6 


fc = 51 


4.6 


2.80 


9 


chrl4 (Velvet) 






fc = 41 


2.4 


74.56 


764 


fc = 51 


4.0 


79.92 


843 


fc = 61 


5.4 


82.10 


431 


fc = 71 


4.7 


81.89 


251 


fc = 81 


1.8 


74.18 


153 


B. impatiens (SOAPdenovo2) 






fc = 41 


5.4 


224.05 




fc = 51 


10.4 


229.71 




fc = 61 


9.5 


230.36 




A: = 71 


5.9 


226.11 




A: = 81 


2.5 


207.11 





Table 1. Quality of assemblies for different values of k. The value of k 
predicted by KmerGenie is underlined. 



overlap by > 1 kb, or (iii) the flanking sequences align on opposite 
strands or different chromosomes. 

For 5. aureus, the assembly with the chosen k had the best NG50 
and size, but had more errors than other assemblies. For chrl4, the 
assembly with the chosen k did not have the best NG50 or size 
(though it was close), but did have significantly less errors. For B. 
impatiens, the chosen k gave an assembly with the best NG50 and 
second best size (the number of errors is unknown since there is 
no reference). Overall, KmerGenie's choice of k led to the best 
assemblies of S. aureus and B. impatiens, as measured by the NG50 
and assembly size, and to a good assembly of chrl4 that represents a 
compromise between NG50/assembly size and the number of errors. 

KmerGenie is multi-threaded, it uses one thread per k value. For 
a single thread, the running time and memory usage of KmerGenie 
is shown in Table[2] We also compared the speed of our approximate 
histogram generation to what could be achieve d by the exact 
A;-mer counting method DSK iRizk et all l2013ll . KmerGenie 
ran 6-10 times faster than DSK, confirming that our sampling 
approach leads to significant speed-ups. For the largest dataset 
(B. impatiens), the total wall-clock time of KmerGenie with 
A:-mer set {21,31,41,51,61,71,81} using 7 threads is 3 hours. 
Note that assembling B. impatiens using SOAPdenovo2 requires 
approximately 30 CPU hours for a single value of A (10 wall-clock 
hours, 8 threads). 

3.3 Comparison with VelvetOptimizer and 
VelvetAdvisor 

We are aware of only two other methods that have been proposed 
to optimize k. VelvetOptimizer (abbreviated VO) is an unpublished 
tool that attempts to optimize k by performing a Velvet assembly 
for each odd k value between 19 and 79 and picking the one that 
yields the highest scaffold N50 (Seemann, http : //dna .med. 



Organism 


CPU time 
DSK KmerGenie 


Memory usage of 
' KmerGenie (GB) 


S. aureus 

chrl4 

B. impatiens 


2min 
48min 
7. 5 hour 


Usee 
7min 
1.2hour 


0.1 
0.1 
0.4 



Table 2. Resource utilization of KMERGENIE compared to a fe-mer 
counting based approach (DSK). We executed KMERGENIE and DSK for 
a single value of k (81) using one thread. KMERGENIE was executed with a 
sampling frequency of e = 1000. DSK used 5 GB of memory. 



[monash . edu . au/ ~torsten/velvet_advisorl ). It then 
determines coverage cut-off parameters that yield the longest 
assembly in contigs longer than 1 kbp. Since this requires doing 30 
assemblies, using VO on the chrl4 data would require close to half 
a CPU-year. We therefore were not able to evaluate VO on chrl4 or 
the even larger B. impatiens. 

We did execute VO on the S. aureus dataset, using the default 
optimization parameters. Each Velvet assembly requires around 40 
minutes of CPU time, resulting in 20 hours computation (though 
this can be parallelized). VO selected the assembly with k = 41, 
with expected coverage value of 12, and cut-off value of 6.47. 
Compared to the assembly based on KmerGenie's choice of A = 
31 (shown in Table [TJ, VO's assembly has slightly higher size 
(2.85kbp vs 2.83kbp), significantly lower contig NG50 (11.6kbp 
vs 19.4kbp), and significantly more errors (23 vs 10). The VO 
assembly has a higher scaffold N50 (734 kbp versus 257 kbp), 
but this high N50 value may be misleading; QUAST reports a 
NGA50 value (corrected scaffold NG50) of 11.6 kbp for the VO 
assembly, and 110.2 kbp for KmerGenie assembly. We conclude 
that KmerGenie's choice of k leads to a better assembly than VO's 
in this case. We note, however, that we believe the main advantage 
of KmerGenie over VO's approach is that it is orders of magnitude 
faster (2mins vs 20 hours) and is applicable even when VO is not 
feasible (e.g. chrl4 and B. impatiens). 

Another method is the unpublished tool Velvet Advisor 
(Gladman and Seemann, http: //bioinf ormatics .net . 
au/ software . velvet opt imiser . shtml) . It uses a formula 
to recommend a value of A, given the number of reads, the 
read length and the estimated genome length. Velvet Advisor 
recommends A = 81 for the 5. aureus dataset, k = 51 for the 
cIirM dataset and A = 85 for the B. impatiens dataset. The A value 
for the chrl4 dataset leads to a good assembly, but, for S. aureus and 
B. impatiens, the assemblies using these values are very poor. 



3.4 Effect of sampling and the fit of the statistical model 

Since the main purpose of the histograms is to contrast the 
differences between different A values, we measure the accuracy of 
our approximate histogram by comparing it at a fixed k value (51) 
to the exact distribution of A = 51 and the exact distributions of 
nearby fc (41 and 61). The results for our three datasets are shown 
in Figure [T] We observe that the sampled histogram closely follows 
the exact one and easily discriminates between other A values when 
such a discrimination is possible from the exact counts. 
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Fig. 1. The accuracy of the samphng method. The panels reflect the three datasets: S. aureus (a), chrl4 (b), and B. impatiens (c). Each plot show the exact 
histogram curves for fc = 51 (solid black curve), fc = 41 (dash-dot red curve), and fc = 61 (dashed green curve). The approximate (sampled) histogram is 
shown using black dots. Note that y is shown on a log-scale, exaggerating the differences at lower y values. 



We illustrate the effect of k on the abundance histogram for chrl4 
in Figure|2] In this case, the histogram is dominated by a mixture of 
a distribution for erroneous fc-mers and one for genomic fc-mers. As 
k increases, the genomic distribution shifts left and becomes more 
narrow, resulting in a larger overlap with the eiToneous distribution. 

Figure |2] also shows the fit of our model to the histogram. 
Even though the human genome is diploid, its heterozygosity rate 
is small enough that we model the fc-mer abundance histogram 
using the haploid statistical model. The high copycounts appear 
to be inaccurately fitted, but note that the log-scale amplifies the 
difference in low abundances. For B. impatiens, on the other hand, 
the haploid model does not lead to a good fit, likely due to a 
possibly higher polymorphism rate. Figure |3] shows the difference 
in fit between using a haploid and diploid model for B. impatiens. 

3.5 Relation of the number of distinct genomic fc-mers 
to assembly quality 

An important component of our method is the prediction of the 
number of distinct genomic fc-mers from the abundance histogram. 
Our underlying assumption is that providing the assembler with 
more distinct genomic fc-mers leads to more of these fc-mers being 
used in contigs and to longer contigs. To measure this effect, we 
plot (as a function of fc) the number of distinct genomic fc-mers 
predicted from the histogram, the number of distinct fc-mers used in 
the assembly, the NG50 of the assembly, and the number of distinct 
fc-mers in the reference (Figure |4]. 

For S. aureus and chrl4, the number of distinct genomic fc- 
mers in the assembly approximatively mirrors the number of ones 
predicted in the input, with the exception of extreme fc values. For 
B. impatiens, the variations of the predicted number of fc-mers do 
not match the variations of the number of distinct fc-mers present 
in the assemblies. We postulate that this discrepancy may be due to 
heterozygosity, and note that the agreement between NG50 and our 
prediction is sufficient for our purpose. 

For all three organisms, the NG50 rises and falls in accordance 
with the number of predicted distinct genomic fc-mers. We also 
observe that KmerGenie overestimates the number of distinct 



genomic fc-mers when compared to the reference fc-mers. A part 
of this is likely due to heterozygosity, which is not captured in the 
haploid reference. However, it may also partially indicate room for 
improvement in our statistical model and/or optimization. 

For the lowest values of fc in the S. aureus and B. impatiens 
datasets, the assemblers produce a much larger assembly than 
expected. We conjecture that this is due to a mis-estimation in the 
assemblers of what constitutes an eiToneous fc-mer (since with low 
fc-mer sizes, erroneous fc-mer have higher abundance than with high 
fc-mer sizes). We verified this conjecture in the 5. aureus dataset, 
by manually assembling 5. aureus with Velvet using fc = 21 and 
a larger coverage cut-off value forced to 7 (experimentally found). 
The new Velvet assembly size is 2.8 Mbp, which is much closer 
to the reference size than the 7.65 Mbp assembly with automatic 
coverage cut-off. Thus, this indicates that larger assemblies are 
artifacts made by assemblers rather than an actual increase of 
genomic fc-mers. 



4 DISCUSSION 

While we have presented a method that attempts to find the best 
value of fc for assembly, we would like to note several limitations 
inherent in this approach. First of all, KmerGenie may in some 
instances report that a best value of fc cannot be found because it is 
not able to fit the generative model to the abundance histograms. 
This could simply be due to a limitation of our model or the 
optimization algorithm, but it could also be due to a difficulty 
inherent in the data. For example, data from sin gle cell experiments 
have uneven coverage JChitsaz et al.L uOllll , violating a basic 
assumption of our model. Similarly, data from metagenomic or 
RNA-seq experiments do not come from a single genome and 
their histograms have different properties. In these cases, it has 
been observed that there is often no single best fc and and that 
combining the assemblies from different fc can be beneficial. 
Though KmerGenie does not suggest a fc in these cases, it provides 
the abundance histograms that can be useful in determining the best 
assembly approach. 
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Fig. 2. The abundance histograms for chrl4 with k values of 21, 41, and 81 (on a y log scale). Each plot also shows a curve corresponding to the optimized 
statistical model (haploid). 
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Fig. 3. The abundance histogram and optimized model for B. impatiens, k = 41, using a haploid (left) and a diploid model (right), on a y log scale. In both 
graphs, the black histogram curves are the actual fc-mer histogram, and the red (solid) curve is the maximum likelihood fit using our model. In the diploid 
model graph, the green (dot-dashed) curve models the heterozygous fc-mers, and the blue (dashed) curve models the homozygous fc-mers. Other components 
of the mixture are not shown. 
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Fig. 4. Relation of the number of distinct genomic fc-mers to assembly quality. We show the results for the three datasets: 5. aureus (left), chrl4 (middle), B. 
impatiens (right). We plot the number of distinct genomic fc-mers predicted from the histogram from our model, the number present in the reference, and the 
number present in the assembly. We also show the NG50 of the assembly. 



We have demonstrated our approach to be useful for de Bruijn 
based assembl£rej_Other_assemblers,_^uc^_as_S^^ and 

Durbin. uOUll . follow the alternate string overlap graph approach, 
in which reads are not chopped up into fc-mers. These assemblers 
do not have the k parameter but do have an alternate parameter for 
the minimum length of a non-spurious overlap. Though a formal 
relation between these parameters has not been established, they 
play a similar role in affecting the assembly results. We therefore 
consider it an interesting direction for future research to extend 
our approach to select the best overlap parameter for string overlap 
graph assemblers. 

Finally, we wish to emphasize that our benchmark did not attempt 
to produce the best possible assembly for each organism. Rather, 
we are restricting ourselves to what a "typical" user might do with 
the data: run a single assembly software on un-corrected data, 
and possibly try several fc-mer values. In order to get the best 
possible assembly, one would have to explore the Cartesian product 
of several assemblers, several read error-correction methods, and 
several fc-mer values, which is often a prohibitively long task. 
Notably, because we did not select the best error-correction method 
for each assembler/organism, the assemblies reported in Table [T] 
have lower contig N50 (and also scaffold N50, data not shown) than 
those reported in the GAGE benchmark. 

There are improvements that we have left for future work. The 
first direction is to determine how our method could be applied to 
non-uniform coverage. While a single best k value for metagenome 
and transcriptome assembly is unlikely to exist, perhaps useful 
information could be extracted from the histograms constructed on 
such datasets. The second direction is to explore ways of improving 
the accuracy of our statistical model, potentially leading to more 
accurate estimates of the number of distinct genomic fc-mers. 
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